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The  feature  extraction  process  plays  a fundamental  role  in 
many  theoretical  treatments  of  auditory  pattern  recognition.  At 
some  early  stage  in  the  recognition  process,  the  perceptual 
representation  of  a stimulus  is  broken  down  into  a set  of 
elementary  properties  or  characteristics.  The  central  role  of 
this  stage  can  be  seen  in  Figure  1. 


Insert  Figure  1 here 



In  this  characterization  of  the  pattern  recognition  process,  the 
preliminary  analysis  stage  produces  a relatively  unprocessed 
representation  of  an  incoming  stimulus.  At  this  point  the 
representation  is  thought  to  contain  considerable  noise  and 
redundancy.  The  output  of  this  stage  is  then  transformed  by  the 
feature  extraction  stage  into  a relatively  small  set  of 
distinctive  features — the  basic  building  blocks  of  the 
recognition  process.  As  Anderson,  Silverstein,  Ritz,  and  Jones 
(1977)  have  noted,  "Distinctive  features  are  usually  viewed  as  a 
system  for  efficient  preprocessing,  whereby  a noisy  stimulus  is 
reduced  to  its  essential  characteristics  and  a decision  is  made 
on  these"  (p.  429) . 

Quite  clearly,  feature  extraction  involves  information 
reduction.  Some  information  in  a pattern  is  retained  while  other 
information  is  discarded.  Ideally,  the  set  of  distinctive 
features  should  uniquely  specify  the  stimulus,  preserving  or 
enhancing  perceptually-important  differences  among  stimuli,  and 
reducing  or  eliminating  perceptually-unimpor tant  differences. 


response 


Figure  1.  Flow  diagram  of  a four  stage  pattern  recognition  model. 
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The  significance  of  feature  extraction  in  auditory  recognition 
should  be  obvious.  Since  the  feature  representation  is  efficient 
both  in  terms  of  dimensionality  and  redundancy,  the  subsequent 
decision  process  can  be  undertaken  with  minimal  effort  and 
optimal  reliability.  On  the  other  hand,  an  ineffective  set  of 
distinctive  features  not  only  can  increase  the  amount  of 
subsequent  processing  required,  but,  by  definition,  will  also 
make  satisfactory  performance  impossible. 

Despite  its  central  importance  as  a theoretical  construct, 
the  feature  extraction  process  has  not  been  well-specified  in  the 
literature.  No  true  psychological  theory  of  feature  extraction 
exists.  When  we  say  that  a stimulus  is  reduced  to  its  "essential 
elements,"  what  do  we  mean?  How  are  these  crucial  elementary 
units  determined?  Implicit  in  the  above  discussion  is  the 
assumption  that  a feature  tuning  process  exists  whereby  a set  of 
distinctive  features  is  defined.  In  this  presentation  we  focus 
on  possible  mechanisms  that  underlie  the  feature  tuning  or 
feature  selection  process  in  human  auditory  recognition. 

Feature  Selection  Processes 

As  outlined  above,  the  feature  selection  problem  involves 
picking  a set  of  distinctive  features  from  the  vast  set  of  all 
possible  features.  The  problem  seems  clear  in  the  case  of  a 
statistician  who  is  constructing  an  algorithm  to  classify  a set 
of  acoustic  patterns.  What  acoustic  cues  or  combinations  of 
acoustic  cues  should  be  considered?  Indeed,  the  ideal  features 
may  be  some  complex  function  of  a number  of  more  primitive 
spectral  measurements.  Given  a set  of  preliminary  measurements 
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and  the  desired  categorization  of  the  stimuli,  the  statistician 
must  select  a set  of  distinctive  features  bearing  in  mind  both 
performance  and  economic  U.e. , "how  much  will  the  feature 
extraction  process  cost?")  considerations  (c£.  Meisel,  1972). 

The  feature  selection  problem  for  the  psychologist  has  much 
in  common  with  that  of  the  statistician.  However,  instead  of 
actually  determining  a set  of  "efficient"  distinctive  features 
for  human  auditory  recognition  (although  this  may  be  of  interest 
in  some  applied  contexts) , it  is  our  task  to  identify  the 
features  or  the  feature  selection  process  that  human  listeners 
actually  use. 

Although  a number  of  specific  "natural"  auditory  feature 
selection  processes  can  be  proposed,  two  contrasting  views  are 
implicit  in  the  literature.  The  first  possibility  is  that  Nature 
has  selected  an  optimal  set  of  distinctive  features  through 
natural  selection  and  built  specific  mechanisms,  finely  tuned  to 
detect  these  features,  into  our  auditory  systems.  The  second 
view  affords  more  flexibility.  Perhaps  Nature  has  built  the 
feature  selection  process  into  our  auditory  systems.  In  other 
words,  we  may  have  internalized  a set  of  rules  and  processes  that 
enable  us  to  establish  what  the  distinctive  features  should  be  in 
any  particular  stimulus  context.  Both  views  are  considered  in 
more  detail  below. 

The  property-list  approach.  The  first  view  argues  that  man 
is  equipped  with  a set  of  specific  feature  detecting  mechanisms. 
In  terms  of  auditory  pattern  recognition,  this  approach  places  an 
emphasis  on  the  feature  detectors  themselves.  An  auditory 
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feature  extractor  is,  from  this  perspective,  a filter-like  device 
that  monitors  the  incoming  stream  of  sensory  information  for 
particular  stimulus  properties.  In  short,  each  detector  is  tuned 
to  "look  for"  a particular  stimulus  property,  and  a set  of 
feature  detectors  determine  a property  list  for  the  stimulus.  In 
the  extreme,  this  seems  consistent  with  previous 
neurophysiological  investigations  of  single  unit  responding  in 
sensory  systems.  The  first  relevant  evidence  emerged  many  years 
ago  from  the  seminal  work  of  Lettvin,  Maturana,  McCulloch,  & 
Pitts  (1959)  with  the  frog's  visual  system.  Their  pioneering 
research  revealed  the  presence  of  highly  selective  neurons  in  the 
visual  periphery  tuned  to  select  stimuli  of  particular  relevance 
for  the  animal's  survival.  Moving  edge,  overall  illumination, 
and  the  highly-popularized  "bug"  detectors  were  among  those 
discovered.  It  seems  that  Nature  equipped  the  frog  with  special 
detectors  for  virtually  everything  it  needs  to  know  about  its 
visual  world.  Similar  work  soon  followed,  investigating  feature 
detectors  in  both  the  visual  (Hubei  & Wiesel,  1962)  and  auditory 
systems  (Whitfield  & Evans,  1965).  This  research  stimulated 
considerable  speculation  about  hierarchical  decision  mechanisms 
where  feature  information  is  combined  and  re-combined,  ultimately 
leading  to  classification  (£f.  Weisstein,  1973).  When  extended, 
this  line  of  reasoning  leads  us  to  Sherrington's  (1941)  notion  of 
a supreme,  "pontifical"  cell  whose  response  signals  the  presence 
of  a particular  complex  pattern.  Although  he  opted  for  a more 
democratic  system  of  "cardinal"  cells  in  place  of  the  all-knowing 
pontiff,  Barlow  (1972)  succinctly  summarized  this  approach  in  his 
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specification  of  a "neuron  doctrine  for  perceptual  psychology." 

A good  example  of  this  sort  of  system  in  human  audition  is 
the  set  of  distinctive  features  and  feature  detectors 
hypothesized  to  underlie  human  speech  perception  (Fant,  1973). 
Here  a relatively  small  number  of  distinctive  features  have  been 
described  that  may  be  used  to  uniquely  characterize  individual 
phonemes.  A voicing  detector,  for  example,  would  monitor  the 
speech  stream  for  cues  that  distinguish  between  voiced  and 
unvoiced  stop  consonants.  In  an  initial  study,  Eimas  & Corbit 
(1973)  used  a psychophysical  procedure  to  obtain  evidence  for  the 
existence  of  voicing  detectors,  finely  tuned  to  a relatively 
narrow  range  of  voice  onset  times  (i^e.,  formant  onset 
asynchronies  in  the  speech  signal) . More  recent  work  has 
generalized  their  findings  to  include  the  psychophysical 
investigation  of  a variety  of  linguistic  as  well  as 
non-linguistic  feature  detectors  in  the  human  auditory  system 
(Cooper , 1975 ) . 

The  process-or iented  approach.  The  second  alternative 
assumes  that  man  has  internalized  the  feature  selection  process 
itself.  In  contrast  to  the  property-list  approach,  the  important 
"features"  in  a complex  sound  reflect  whatever  structure  exists 
in  the  output  of  the  preliminary  analysis  stage.  In  this  sense. 
Nature  has  endowed  us  with  a set  of  rules  and  criteria  for 
feature  selection  rather  than  with  highly-tuned  detection 
mechanisms. 

In  arguing  that  the  feature  selection  process  is  built  in, 
we  necessarily  assume  that  certain  general  principles  exist  that 
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can  characterize  feature  selection  across  a variety  of  stimulus 
and  task  conditions.  These  invariants  include  both  the  selection 
criteria  employed  and  a mechanism-  for  applying  them.  In  the 
present  discussion,  we  assume  that  the  feature  selection  process 
attempts  to  reduce  the  dimensionality  of  the  stimulus 
representation  while  preserving  as  much  of  the  stimulus  structure 
as  possible.  An  example  of  this  approach  may  be  seen  in 
Wightman's  (1973)  pattern  transformation  model  of  periodicity 
pitch  perception.  The  model  assumes  that  the  auditory  system 
performs  two  successive  Fourier  transforms  (equivalent  to  an 
autocorrelation  in  the  time  domain)  to  extract  periodicity 
information  from  a complex  tone.  Since  signal  periodicity 
reflects  the  frequency  relations  among  individual  spectral 
components,  the  proposed  transformation  analyzes  the  relational 
structure  of  the  stimulus.  Uttal  (1975)  has  outlined  a similar 
autocorrelational  model  for  visual  pattern  detection. 

Implications  for  auditory  perception.  The  intention  of  this 
discussion  is  not  to  suggest  that  one  or  the  other  of  these  views 
is  necessarily  correct — at  this  point,  the  most  nearly  correct 
view  would  seem  to  include  elements  of  both  perspectives. 
Rather,  we  consider  these  particular  approaches  because  they 
occupy  opposite  ends  on  a continuum  of  feature  selection 
flexibility.  This  difference  has  a number  of  implications  for 
auditory  recognition  theory,  and  in  particular,  for  recognition 
processes  in  the  perception  of  the  timbre  of  complex  sounds.  As 
Plomp  (1976)  has  noted,  timbre  is  generally  defined  as  "that 
attribute  of  auditory  sensation  in  terms  of  which  a listener  can 
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judge  that  two  steady-state  complex  tones  having  the  same 
loudness  and  pitch,  are  dissimilar"  (Pp.  85-86).  In  other 
words,  timbre  is  everything  left  over  after  we  take  away  loudness 
and  pitch.  Quite  clearly,  timbre  does  not  describe  a single 
perceptual  attribute  of  sound,  but  rather,  it  represents  a family 
of  perceptual  attributes.  One  would  be  entirely  at  ease  in 
reporting  that  a complex  sound  has  a high  pitch  or  is  very  loud, 
but  to  attempt  a simple  description  of  this  sort  for  timbre  would 
seem  ridiculous.  If  asked  to  discuss  the  timbre  of  a sound,  the 
listener  would  likely  resort  to  a number  of  adjectives,  "it  is 
coarse,  pleasant,  bright,  etc."  (von  Bismarck,  1974).  But  even 
given  this  verbal  flexibility,  the  listener  would  find  it 


difficult  to  adequately  describe  the  timbre  of  a complex  tone. 

If  one  were  to  adopt  a property-list  approach  to  the  feature 
extraction  stage  for  timbre  perception,  it  would  be  necessary  to 
specify  a list  of  important  timbre  attributes.  -However,  it 
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the  feature  extraction  process.  In  all  three  experiments,  the 
features  ’ that  listeners  use  in  perceiving  the  timbre  of  complex 
sounds  were  investigated.  The  results  of  the  .first  two 
experiments  serve  to  emphasize  the  importance  of  the  stimulus 
population  in  determining  the  timbre  attributes  that  listeners 
use  in  comparing  complex  sounds.  The  findings  of  the  third 
experiment  illustrate  the  role  of  task  factors  in  the  feature 
extraction  process. 

Stimulus  Effects  in  Feature  Extraction 

In  the  above  discussion  we  argued  that  the  feature 
extraction  process  in  timbre  perception  may  be  more  appropriately 
viewed  as  a structure  analyzing  process  than  as  a feature 
detection  process.  In  order  to  evaluate  this  hypothesis 
empirically  it  is  necessary  to  examine  the  output  of  the  feature 
extraction  stage,  and  to  relate  this  feature  representation  to 
the  known  properties  of  its  input.  Although  the  feature 
representation  is  obviously  not  directly  observable,  it  may  be 
inferred  using  a variety  of  psychophysical  procedures.  In 
particular,  multidimensional  scaling  has  emerged  as  a useful 
method  for  identifying  the  underlying  psychological  structure  of 
complex  sounds  (Plomp,  1976)  . Typically,  listeners  are  asked  to 
provide  pairwise  or  triadic  dissimilarity  judgments  on  the  set  of 
signals  of  interest.  A specific  multidimensional  scaling 
algorithm  is  then  applied  to  decompose  the  resulting  subjective 
proximity  matrix  into  an  n-dimensional  metric  space  in  which  each 
stimulus  is  represented  as  a single  point  or  vector.  Providing 


that  an  interpretable  solution  with  satisfactory  stress  exists. 
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it  is  generally  assumed  that  the  dimensions  in  the  psychological 
stimulus  space  reflect  those  features  that  the  listeners  employed 
in  comparing  the  stimuli.  In  other  words,  it  is  at  least 


implicitly  assumed 

that 

the 

scaling 

methods 

provide  an 

approximate  inverse  to 

the 

later 

stages 

of  the 

recognition 

process  (cf.  Figure 

1)  . 

If  we 

are  willing  to 

make  certain 

assumptions  about  the  information  available  after  the  preliminary 
analysis  stage,  then  we  have  the  input/output  information 
necessary  to  speculate  about  the  feature  extraction  process. 

In  the  first  systematic  application  of  these  methods  to 
audition,  Plomp  and  his  associates  (Plomp,  1970)  compared  the 
timbre  properties  of  nine  musical  instrument  sounds  to  their 
corresponding  spectral  structure.  A three-dimensional  stimulus 
space  was  revealed  in  a multidimensional  scaling  analysis  of  the 
subjective  similarity  data  for  these  stimuli.  The  configuration 
of  interstimulus  distances  in  this  perceptual  space  correlated 
highly  with  the  corresponding  distances  in  a three-dimensional 
physical  space  obtained  in  a physical  analysis  of  these  sounds. 
Although  Plomp  was  primarily  interested  in  determining  whether  a 
correlation  existed,  for  our  present  purposes,  the  specific 
methods  used  to  obtain  the  physical  space  are  of  particular 
interest. 

Specifically,  the  physical  analysis  was  based  on  information 
that  could  be  reasonably  thought  to  approximate  that  available  to 
the  auditory  feature  extraction  process.  Recognizing  the  limited 
spectral  analyzing  ability  of  the  human  auditory  system  (Zwicker, 
Flottorp,  6 Stevens,  1957),  Plomp  obtained  1/3-octave  band-level 
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measurements  for  each  of  his  complex  sounds.  A 
principal-components  analysis  was  then  performed  on  these 
spectra.  The  results  of  this  analysis  revealed  that  each  of  the 
nine  stimuli  could  be  characterized  in  terms  of  three  spectral 
attributes  with  very  little  loss  of  information.  If  we  are 
willing  to  assume  that  the  1/3-octave  spectrum  approximates  the 
output  of  the  preliminary  analysis  stage  depicted  in  Figure  1, 
then  these  findings  suggest  that  the  listener's  feature 
extraction  process  may  be  somewhat  similar  to  a 
principal-components  analysis. 

This  conclusion  is  entirely  consistent  with  our  hypothesis 
that  *-he  feature  extraction  stage  in  timbre  perception  involves  a 
structural  analysis  of  the  stimuli.  More  specifically,  the 
principal-components  analysis  may  be  thought  of  as  a structure 
preserving  transformation  that  maps  stimuli  from  one  space  into 
another  of  lower  dimensionality.  The  first  principal  component 
is  simply  a new  axis  in  the  original  space  (in  this  case  the 
measurement  space  spanned  by  the  1/3-octave  band-levels)  that 
accounts  for  most  of  tne  variability  in  the  data.  In  other 
words,  the  set  of  projections  of  stimuli  in  the  measurement  space 
onto  the  first  principal  component  has  maximum  variance.  The 
second  principal  component  is  an  axis  orthogonal  to  the  first 
that  accounts  for  most  of  the  residual  variance  and  so  on.  [In 
practice,  the  principal  components  are  determined  by  selecting 
the  eigenvectors  of  the  covariance  matrix  for  the  stimulus  set 
that  correspond  to  the  largest  eigenvalues  (Harris,  1975)]. 

In  this  view,  then,  the  feature  extraction  process  selects  a 
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subspace  of  the  original  space  that  preserves  as  much  of  the 
variability  among  stimuli  as  possible.  It  is  clear  that  these 
features-  U.e.,  the  principal  components)  reflect  the  structure 
of  the  stimulus  set,  and  therefore  we  would  expect  the  important 
perceptual  features  to  vary  dramatically  depending  on  the 
stimulus  context. 

This  finding  is  given  some  generality  by  a similar  result 
obtained  in  our  laboratory.  In  our  experiment  listeners  were 
asked  to  rate  the  pairwise  similarity  of  eight  passive  sonar 
recordings.  Two  perceptual  dimensions  were  extracted  from  these 
data  using  a metric  multidimensional  scaling  procedure  (Howard, 

1977).  The  results  were  then  compared  with  the  outcome  of  a 
physical  analysis  of  the  stimuli  that  paralleled  the  analysis 
described  above.  The  1/3-octave  spectrum  was  obtained  for  each 
of  the  eight  sounds.  These  data  were  then  submitted  to  a 
principal-components  analysis.  Since  most  of  the  variability 
could  be  accounted  for  by  the  first  principal  component,  it  was 
concluded  that  the  steady-state  characteristics  of  these  sounds 
could  be  adequately  summarized  by  a single  measurement.  This 
derived  physical  attribute  closely  approximated  one  of  the 
perceptual  dimensions  obtained  in  the  scaling  analysis.  (The 
other  perceptual  dimension  revealed  in  the  scaling  analysis 


reflected  a temporal  property  of  the  sounds  and  is  not  directly 
relevant  to  this  discussion.)  A closer  examination  of  the 
specific  signal  values  on  this  extracted  dimension  suggested  that 
it  summarized  the  overall  shape  of  the  spectra,  and  in 
particular,  the  degree  of  bimodality  of  the  spectra.  When  asked 
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to  describe  stimulus  differences  along  this  dimension,  listeners 
used  such  terms  as  "this  one  is  more  uniform"  or  "in  this  one 
there  seems  to  be  more  than  one  sound  present." 

In  this  experiment,  as  in  Plomp's,  it  appears  that  a 
structure  preserving  transformation  reasonably  approximates  the 
analysis  performed  by  the  feature  extraction  stage.1  Similar 
findings  have  also  been  reported  for  the  analysis  of  steady-state 
vowel  spectra  (Klein,  Plomp,  & Pols,  1970).  Although  these 
experiments  were  conducted  for  another  purpose,  the  findings  are 
generally  consistent  with  the  present  hypothesis  that  the  feature 
extraction  stage  for  timbre  perception  is  best  viewed  as  a 
structure  analyzing  process.  Since  the  principal  components  are 
simply  weighted  linear  combinations  of  the  more  basic 
measurements  (in  this  case  the  1/3-octave  band-levels),  it  is 
clear  that  we  could  also  develop  a weighted  property-list  scheme 
to  account  for  these  findings.  In  such  a system,  the  listener 
would  adjust  the  measurement  weights  to  develop  "features"  that 
maximally  discriminate  among  the  stimuli.  Nonetheless,  the 
objectives  of  this  system  are  more  naturally  discussed  in  terms 
of  the  structure  analyzing  approach. 

The  major  emphasis  of  the  above  discussion  is  that  feature 
extraction  is  both  efficient,  in  a dimensionality  reducing  sense, 
and  flexible,  in  that  it  readily  adapts  to  the  stimulus  context. 
We  have  argued  that  a principal-components  analysis  shares  these 
characteristics  and  is  therefore  a possible  model  for  the  feature 
extraction  process  in  timbre  perception.  We  would  like  to 
emphasize,  however,  that  a variety  of  other  structure  preserving 
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transformations  are  also  adequate.  For  example,  we  could  select 
a multidimensional  scaling  algorithm  that  would  reduce  the 
dimensionality  while  preserving  the  configuration  of 
inter-stimulus  distances  in  the  measurement  s£ace. 

Task  Effects  in  Feature  Extraction 

In  the  experiments  outlined  above,  listeners  were  simply 
required  to  evaluate  relative  stimulus  similarity.  In  this 
situation  there  are  no  correct  or  incorrect  judgments.  It 
therefore  seems  reasonable  that  listeners  would  employ  a feature 
extraction  transformation  that  preserves  as  much  of  the  spectral 
information  in  the  stimuli  as  possible.  It  is  obvious,  however,' 
that  in  a classification  task  where  performance  is  evaluated  in 
terms  of  external  criteria,  the  requirements  of  the  feature 
extraction  process  would  be  quite  different.  As  Figueiredo 
(1976)  has  pointed  out,  the  performance  of  the  entire  system  must 
be  considered  when  selecting  the  optimum  features  in  this 
situation.  We  have  already  seen  in  Getty's  paper  (Getty,  Swets, 
Swets,  & Green,  in  press)  that  observers  emphasized  different 
features  in  a visual  classification  task  than  they  did  in  a 
visual  comparison  task.  The  experiment  described  below  (Howard, 
Balias,  & Burgy,  1978)  demonstrates  a similar  effect  in  an  aural 
classification  situation,  and  illustrates  the  role  of  task 
factors  in  feature  extraction. 

The  stimuli  investigated  in  this  study  consisted  of  sixteen 
broadband  noise  signals  amplitude  modulated  by  sawtooth  waves  of 
varying  frequency  and  attack.  Four  levels  of  modulation 
frequency  (4,  5,  6,  and  7 Hz)  and  four  combinations  of 
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attack/decay  (20  and  40  msec)  were  used.  Eight  listeners  learned 
to  classify  the  sixteen  signals  on  the  basis  of  one  of  two 
eight-category  partitions.  The  two  partitions  were  selected  to 
emphasize  one  or  the  other  dimension  by  requiring  listeners  to 
discriminate  among  all  four  levels  of  this  dimension  and  only  two 
levels  of  the  other  dimension.  The  two  partitions  are  presented 
schematically  in  Figure  2.  Here  we  have  labeled  the  perceptual 
dimension  corresponding  to  attack  "Quality,  " and  the  perceptual 
dimension  corresponding  to  modulation  frequency  "Tempo." 


Insert  Figure  2 here 


Clearly,  the  features  are  not  of  equal  importance  in  the  two 
partitions.  Listeners  in  the  Quality  group  were  required  to 
discriminate  relatively  small  differences  in  attack,  whereas 
listeners  in  the  Tempo  group  were  required  to  discriminate 
relatively  small  differences  in  modulation  frequency.  The 
confusion  data  from  this  experiment  were  analyzed  in  terms  of  a 
probabilistic  model  of  the  classification  process. 

Our  model  assumes  that  the  decision  stage  operates  on  the 
output  of  the  feature  extraction  process  and  classifies  stimuli 
so  as  to  maximize  the  probability  correct  (cf.  Howard  et  al, 
1978).  An  important  assumption  of  the  model  is  that,  with 
feedback,  listeners  perform  a selective  tuning  of  their  feature 
extraction  processes.  Theoretically,  the  tuning  process 
determines  a weighting  factor  for  each  of  the  two  features.2 
Selective  tuning  occurs  when  the  weighting  factor  for  one 
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Figure  2.  Schematic  representation  of  two  partitions  of  the 
sixteen  signals  Into  eight  categories. 
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dimension  increases  relative  to  the  other. 

Weighting  parameters  for  each  feature  were  estimated  for 
individual  listeners  by  fitting  the  model  to  the  observed 
confusion  matrices  using  a standard  gradient  technique.  The 
weights  obtained  for  each  practiced  listener  are  displayed  in 
Figure  3. 


Insert  Figure  3 here 


It  is  evident  that  our  listeners  responded  to  the  demands  of 
their  classification  task.  Listeners  in  the  Tempo  group  had 
greater  weights  for  signal  Tempo  than  signal  Quality,  whereas  the 
opposite  was  true  for  the  Quality  group.  It  may  also  be  noted 
that  no  individual,  with  the  possible  exception  of  listener  PH, 
maximized  the  weights  for  both  features  simultaneously.  We 
interpreted  this  finding  to  suggest  that,  at  least  in  the  context 
of  this  experiment,  the  total  amount  of  feature  tuning  that  can 
occur  at  any  point  is  limited.  Since  we  assumed  that  feature 
tuning  reflects  the  operation  of  a selective  attentional 
mechanism,  this  interpretation  is  consistent  with  recent  limited 
capacity  views  of  human  attention  (e.j. , Kahneman,  1973). 

Given  an  overall  limit  on  the  amount  of  feature  tuning  that 
can  occur,  we  wondered  what  strategy  the  listeners  used  to 
determine  how  much  emphasis  to  place  on  each  feature.  In  our 
decision  model  we  assumed  that  listeners  attempt  to  maximize  the 
probability  correct.  Therefore,  we  investigated  the  possibility 
that  the  feature  weights  were  also  determined  on  this  basis  by 
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Figure  3.  Estimated  weighting  parameters  for  each  feature  by 
Individual  listener  and  group. 
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estimating  the  theoretically  optimal  weights  for  each  practiced 
listener.  In  computing  these  values,  the  sum  of  the  two  observed 
weights  was  taken  to  reflect  the  overall  attentional  effort 
expended  by  each  listener.  This  overall  value  was  then 
apportioned  between  the  two  features  so  as  to  maximize  the 
average  probability  correct.  The  normalized  obtained  and  optimal 
weights  reflecting  the  relative  importance  of  signal  Tempo  for 
each  listener  are  displayed  in  Figure  4. 


Insert  Figure  4 here 


Although  in  general  the  obtained  weights  are  reasonably  well 
approximated  by  the  optimal  values  (the  overall  Pearson 
product-moment  correlation,  jr(15)  * .98),  a small  but  consistent 
discrepancy  is  evident.  Six  of  our  eight  listeners  showed  a 
tendency  to  overemphasize  the  more  important  of  the  two  features. 

Nonetheless,  it  is  clear  from  these  findings  that  listeners 
show  considerable  flexibility  in  the  emphasis  they  place  on 
individual  perceptual  features  in  an  aural  classification  task. 
With  practice,  feature  tuning  processes  increase  t;he  importance 
of  both  features.  More  importantly,  for  experienced  listeners 
the  tuning  process  appears  to  operate  selectively  with  relative 
feature  emphasis  determined  by  a strategy  that  attempts  to 
maximize  the  overall  probability  correct.  Of  further  interest  is 
a comparison  of  these  findings  with  the  results  of  two 
multidimensional  scaling  studies  involving  the  stimuli  described 
above.  In  the  first,  an  independent  group  of  30  listeners  rated 
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Figure  4.  Normalized  obtained  and  optimal  weights  for  the  Tempo 
feature  by  Individual  listener  and  group. 
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the  pairwise  similarity  of  all  sixteen  signals.  A 
multidimensional  scaling  analysis  of  these  data  revealed  that  22 
of  the  30  participants  placed  a greater  emphasis  on  signal 
Quality  than  Tempo.  In  the  second  experiment,  the  observers  from 
the  above  classification  experiment  rated  the  pairwise  similarity 
of  the  eight  stimulus  categories  they  had  learned.  In  this  case, 
only  the  category  labels  were  provided,  no  signals  were  actually 
presented.  The  data  were  decomposed  into  a "conceptual"  space 
where  the  dimensions  corresponded  to  modulation  frequency  (i^.e. , 
Tempo)  and  attack  (i.e. , Quality).  Although  we  had  expected  the 
subjective  importance  of  these  dimensions  to  be  strongly 
influenced  by  the  prior  classification  training  (i,.e. , Tempo  more 
important  for  the  Tempo  group  and  Quality  more  important  for  the 
Quality  group),  listeners  in  both  groups  placed  a greater 
relative  emphasis  on  signal  Quality.  This  finding  is  more  in 
line  with  the  results  of  the  first  scaling  study  than  with  the 
results  of  the  classification  study.  It  appears,  therefore,  that 
in  the  similarity  rating  task,  listeners  tended  to  emphasize 
Quality  more  than  Tempo,  whereas  in  the  classification  task  they 
emphasized  the  dimension  that  would  lead  to  optimal  performance. 
This  result  further  illustrates  the  importance  of  task  factors  in 
determining  the  relative  subjective  importance  of  stimulus 
features. 

Conclusion 

As  indicated  in  the  introduction,  the  feature  extraction 
process  plays  a central  role  in  most  theoretical  approaches  to 
auditory  pattern  recognition.  Incoming  stimuli  are  analyzed  in 
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terms  of  a set  of  distinctive  or  characteristic  features  that 
form  the  basis  for  all  subsequent  perceptual  processing. 
Nevertheless,  relatively  little  research  has  focused  on  the 
selection  and  tuning  processes  whereby  these  essential  perceptual 


properties 

of 

stimuli  are 

defined . 

In 

this  paper  we  have 

considered 

two 

contrasting 

views  of 

the 

feature  selection 

process.  On  the  one  hand,  it  is  possible  that  finely-tuned 
stimulus  property  analyzers  exist  in  the  human  auditory  system. 
As  Fant  (1973)  has  pointed  out,  this  property-list  or 
detector-oriented  approach  is  by  definition  context  free.  Some 
property  analyzers  will  respond  to  a specific  stimulus  whereas 
others  will  not.  In  contrast,  it  is  also  possible  that  the 
feature  extraction  process  is  highly  context-sensitive.  In  this 
latter  process-oriented  approach,  feature  selection  is  viewed  as 
a continuous  on-going  process.  The  distinctive  features  used  to 
characterize  a particular  sound  emerge  from  a structural  analysis 
of  the  more  basic  psychophysical  measurements  obtained  by  the 
auditory  system. 

Overall,  the  findings  outlined  above  appear  consistent  with 
the  more  flexible,  process-oriented  approach  to  feature 
selection.  In  the  comparative  judgment  task  listeners  are 
required  to  evaluate  stimulus  similarity.  Since  no  specific 
comparison  criteria  are  typically  indicated,  listeners  need  only 
know  something  of  the  structure  of  the  stimulus  set  in  order  to 

I make  their  judgments.  The  features  identified  in  the  similarity 

rating  experiments  outlined  above  generally  reflect  the  spectral 
characteristics  of  the  stimuli.  We  argued  that  in  these  studies. 
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the  feature  extraction  process  is  most  naturally  viewed  as  an 
on-going  structural  analysis  of  the  low-resolution  spectra 
extracted  from  the  stimuli  by  peripheral  auditory  mechanisms. 
However,  in  the  case  of  a classification  task,  a simple 
structural  analysis  of  the  stimulus  set  is  not  sufficient.  In 
this  situation,  the  assignment  of  stimuli  to  categories 
effectively  changes  the  important  structural  properties  of  the 
stimulus  set.  Particular  partitions  of  the  stimuli  serve  to 
emphasize  some  stimulus  relations  while  de-emphasizing  others. 
We  have  argued  that  in  this  sort  of  task  an  additional  feature 
tuning  process  occurs  that  adjusts  the  relative  emphasis  placed 
on  the  important  structural  features  to  accommodate  the  external 
task  constraints.  Furthermore,  the  results  of  our  classification 
study  suggested  that  the  feature  tuning  process  operates  on  a 
limited-capacity  basis,  and  that  the  fine-grained  adjustment  of 
feature  emphasis  is  based  on  a strategy  that  attempts  to  optimize 
the  probability  correct.  These  findings  emphasize  the  importance 
of  overall  performance  considerations  in  the  feature  extraction 
process.  In  this  sense,  distinctive  auditory  features  are  tuned 
not  only  to  the  stimuli,  but  also  to  the  decision  rule  employed 
by  the  listener  (Figueiredo,  1976). 

Before  closing  we  would  like  to  offer  a few  caveats.  First, 
although  our  conclusions  were  derived  from  the  findings 
summarized  above,  these  experiments  were  not  designed  to 
explicitly  test  the  issues  addressed  in  this  paper.  For  this 
reason  our  conclusions  must  be  regarded  as  tentative  and 
speculative.  No  single  experiment,  for  example,  has  enabled  us 
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to  simultaneously  examine  the  effects  of  both  stimulus  structure 
and  task  demands.  Similarly,  we  have  not  addressed  the  detailed 
problem  of  how  the  feature  selection  process  may  operate  on  a 
trial  by  trial  basis.  How  might  the  proposed  structural  analysis 
proceed  in  an  incremental  fashion?  We  have  argued  that  in  many 
cases  auditory  distinctive  features  are  determined  largely  by  the 
stimulus  context.  What  external  and  subjective  factors  determine 
the  relevant  context  in  any  particular  situation?  Quite  clearly 
the  answers  to  these  and  other  questions  must  await  further 
investigation.  Experiments  designed  to  address  some  of  these 
issues  are  presently  underway  in  our  laboratory. 

Finally,  it  is  important  to  remember  that  we  are  not 
proposing  that  either  of  the  two  approaches  we  have  considered  is 
necessarily  "correct."  As  we  have  indicated,  these  approaches 
represent  extremes  on  a continuum  of  possible  feature  selection 
mechanisms.  Although  one  may  challenge  the  extreme  property-list 
position  on  logical  grounds  (e.j. , Weisstein,  1973;  Uttal, 
1978),  a weighted  property-list  approach  begins  to  resemble  the 
process-oriented  view  discussed  above.  Furthermore,  we  suspect 
that  it  is  impossible  to  distinguish  between  a modified 
property-list  model  and  the  pr-ocess-or iented  model  using  only 


psychophysical  techniques.  The  distinction  we  have  considered  is 
significant  because  of  its  impact  on  theory  and,  hence,  on  the 
empirical  questions  that  are  appropriate  to  ask.  A strict 


property-list  model  would  direct  us  to  search  for  evidence 


regarding  invariant  auditory  feature  detectors,  whereas  a 
process-oriented  model  would  have  us  look  for  common  principles 
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underlying  feature  extraction  across  a wide  range  of  different 
stimuli  and  tasks.  Regardless  of  the  specific  view  that 
ultimately  emerges  as  a primary  solution  to  the  feature  selection 
problem,  it  is  clear  that  future  psychological  research  in 
auditory  pattern  recognition  must  address  the  fundamental 
question  of  how  distinctive  auditory  features  are  determined. 
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Footnotes 

^It  should  be  emphasized  that  Plomp  and  his  associates 
(personal  communication,  1978)  did  not  refer  to  the  dimensions 
extracted  in  their  perceptual  analysis  as  "features."  Rather,  as 
indicated  earlier,  they  were  primarily  interested  in  determining 
the  degree  of  correlation  between  the  configuration  of  stimuli  in 

the  perceptual  space  and  the  steady-state  spectra  of  the  sounds. 

2 

In  our  original  presentation  we  actually  represent  the 
tuning  process  as  an  adjustment  in  the  variability  of  stimulus 
likelihood  functions.  The  weighting  factors  discussed  here  are 
inversely  related  to  the  estimated  standard  deviations  along  each 
dimension. 
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